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While much attention has been paid to the vulnerability of computer networks to node and link failure, there 
is limited systematic understanding of the factors that determine the likelihood that a node (computer) is 
compromised. We therefore collect threat log data in a university network to study the patterns of threat 
activity for individual hosts. We relate this information to the properties of each host as observed through 
network-wide scans, establishing associations between the network services a host is running and the kinds 
of threats to which it is susceptible. We propose a methodology to associate services to threats inspired by the 
tools used in genetics to identify statistical associations between mutations and diseases. The proposed 
approach allows us to determine probabilities of infection directly from observation, offering an automated 
high-throughput strategy to develop comprehensive metrics for cyber-security. 

Given the extent to which critical infrastructures and day-to-day human and economic activity are 
dependent on secure information and communication networks, we must understand their weaknesses 
and uncover the risks that threaten their integrity 13 . Much work has been devoted in the past years to 
understanding the global risks resulting from node removal in communication and infrastructure networks 3 6 , or 
posed by viruses and malware 7 . Less is known, however, about the specific risks that pertain to compromising 
single nodes. 

As cyber-threats proliferate both in their volume and complexity, new strategies are necessary to prevent 
intrusions and to cope with their impact 8 . Current research in cyber-security focuses on characterizing and 
modelling specific attacks, aiming to understand the mechanisms of infiltration, detection and mitigation 9-17 . 
Despite significant efforts in this direction, the number of attacks continues to increase each year 1819 . The 
constantly increasing vulnerabilities and intrusions have lead to a paradigm shift in security: it is now widely 
recognized that absolute security is unattainable 20,21 . While very effective methods to detect and aggregate 
software vulnerabilities have been developed, determining actual risks and probabilities of compromise largely 
relies on expert opinion and heuristics 22 25 . Instead, we need systematic methods to estimate the magnitude of the 
potential risk by determining the probability that a system is compromised and the risk associated with specific 
settings and configurations 26 . Understanding the nature and the components of the current risks could help us 
devote resources to most efficiently reduce the chance of intrusion. 

In this work we introduce an unbiased, context-free and high-throughput statistical framework to evaluate the 
susceptibility of a computer to individual threats. By combining infection data from threat logs with network 
scans we identify correlations between the network services detected on a host and the threats affecting it. The 
proposed method offers a systematic framework to identify new attack vectors and to evaluate the role of 
individual factors contributing to the susceptibility to each threat. 

Data collection 

To understand the recurring threats experienced by a large computer network, we collected the threat logs of an 
Intrusion Detection and Prevention System (IDS/IPS) protecting a university network of approximately 30, 000 
hosts. The IDS/IPS monitors incoming and outgoing traffic, searching for matching signatures from a threat 
database. Each positive match is logged with a time stamp, source and target IP addresses as well as the name 
and identification number of the detected threat. A small fraction of the detected threats (—3.2%) originate within 
the network. Since the threats in the database correspond to well documented malicious activity and constitute 
violations of security policies established by the network administrators, we assume that a threat originating within 
the University network is evidence that the corresponding host at the source IP address has been compromised. Our 
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Figure 1 | Correlating threats and services, (a) Pairs of hosts running similar services have an increased likelihood of being compromised by the same 
threats. The plot shows the mean value of the correlation between threat profiles as a function of the correlations between services for all pairs of 
hosts, indicating that the likelihood is maximal for computers running the same services. The blue line represents the expectation value for random 
resampling of the data. As the Pearson correlation is not defined for a constant signal, the plot excludes hosts for which no threats or no services have been 
found, (b) Odds ratios for the most common threats in the presence of three services: ssh, flexlm and ftp. The presence of certain services drastically 
increases the odds of being compromised by specific threats. 



threat-dataset consists of 501 days of logs produced by the IDS/IPS 
beginning in September 2012, with approximately 16 million entries. 

To remotely gather information about hosts in the network, we 
also performed full network scans of all IP addresses within the 
university, obtaining the list of network services running on all hosts 
found. A total of 11, 726 hosts were successfully scanned, identifying 
464 distinct services. In this sample, 10, 560 hosts were found in the 
logs to have been compromised by at least one of 278 distinct threats. 

Results 

Correlations between services and threats. Our starting hypothesis 
is that the probability that a host is compromised by a specific threat 
is largely determined by the set of reachable services running on it 27 . 
Indeed, many attacks take advantage of vulnerabilities in specific 
network services to bypass authentication and gain access 28 . 
Additionally, certain types of malware may autonomously scan for 
services to identify potential targets, or even actively open ports after 
infection. Furthermore, the presence of specific services on a host can 
be indicative of the ways in which the host is used, its intended 
purpose or its user's security awareness, allowing us to estimate the 
host's susceptibility to specific threats. 



Evidence for this hypothesis is provided in Figure la, showing the 
correlations between services and threats. We define the threat pro- 
file of host i as vector v', where v' k represents the fraction of log entries 
for host i that correspond to threat k, and its service conformation as 
s', where sj is the number of instances of service / running on host i. 
We calculate p(v', V) in function of p(s', sO for all pairs of hosts i and;', 
where p is the Pearson correlation. The blue line is the expectation 
value obtained under randomization of the data that removes all 
potential correlations between services and threats. We find that, 
on average, the correlation between threat profiles for most host pairs 
remains around the values expected in the random case. However, 
for host pairs with high correlation between their services (p(s', sf) > 
0.8) the average correlation between their threat profiles increases 
sharply. This means that hosts running similar services have a sig- 
nificantly increased likelihood of being compromised by the same 
threats. This likelihood is maximal for hosts with identical services 
(PCs', sO = 1). 

Figure lb shows three examples supporting the hypothesis that the 
presence of specific network services is a determining factor for the 
susceptibility to certain threats. Indeed, we find that a computer 
running the ssh service is 16 times more likely to be flagged for the 
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Active Directory RDN Parsing Memory corruption than computers 
without that service. More dramatically, a host running the ftp ser- 
vice is 678 times more likely to trigger alerts for Chrome's FTP PWD 
response vulnerability, and hosts running flexlm are about 370 times 
more likely to be infected by the JBoss Worm (CVE-20 10-0738 in the 
Common Vulnerabilities and Exposures Database, www.cve.mitre. 
org). 

Case-control study. The ability of a threat to compromise a host 
depends on a combination of factors, from the software running on 
the host and its level of exposure to the user's behavior and security 
awareness, to name only a few. Yet, the finding that hosts running 
similar services are susceptible to similar threats suggests that 
significant aspects of these factors are represented in the specific 
combination of network services present on a host. Therefore, to 
understand a system's risks we must systematically untangle the 
correlations between individual services and threats. This is 
traditionally done through domain experience and contextual 
knowledge — system administrators must constantly stay informed 
about current trends pertaining to their system and the type of 
services they run. However, given the rapidly evolving complexity 
of both threats and services, this is a difficult and inefficient task. 
Next we show that we can automate this process by relying on tools 
borrowed from genetics. 

We structure our analysis as a standard case-control study. We 
consider a binary state model in which a given threat has either been 
detected on a host or not, and the host is either running a given 
network service or not. Our case-hosts for the study of a specific 
threat are therefore all the hosts on which that threat has been 
detected by the IDS/IPS. All hosts on which the threat has not been 
observed serve as the control group. 

For a given threat k, the group of case-hosts Aj- that have been 
infected by the threat and the group of control hosts Q- are given by 

A* = {/:v[>0} (1) 



C* = {i:v<=0}. (2) 

For each service these two groups are split into hosts that are running 
the service and hosts that are not. Since we consider each service as a 
risk factor, we refer to hosts running a specific service as exposed. For 
a given service / under scrutiny, the different groups Ak = * + \JA\_ 
and C/c = C; UCf are given by 

A k + ={ieA k :s\>0} (3) 



A k _ = {ieA k :sj = 0} (4) 

for the exposed and unexposed case-hosts, respectively, and 

C k + ={ieC k :s\>0} (5) 



C\ ={ieC k :sj = 0} (6) 

for the exposed and unexposed controls, where the symbols + and — 
correspond to the service running and not running on the hosts, 
respectively. With this, we construct a 2 X 2 contingency table as 
shown in Table I. 

Determining whether service I is significantly associated with 
threat k requires us to evaluate the statistical differences between 
the two columns of this table. This association problem is equivalent 
to that encountered in genetic epidemiology, where the goal is to 
assess the impact of specific genetic mutations on complex diseases 29 , 
a problem addressed using genome-wide association studies 



Table 1 | Case-control framework. Contingency table forth 
ation of an association between the presence of a given 
service and the infection by a given threat 


e evalu- 
network 


running the service not running tf 


e service 


infected by the threat 
threat not detected in host 


A l+ 




At 
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(GWAS) 30 . In these studies, a large number of genetic markers along 
the entire genome are compared between a population of healthy and 
disease-affected individuals. GWAS offers a high-throughput 
method for the unbiased identification of mutation-genotype 
associations. 

We equate the set of all scanned network services to the set of 
chosen mutations (biomarkers) in a GWAS. The presence of a given 
service on a host corresponds to the presence of a certain allele 
(genetic variant) for a given gene. Therefore, our problem has an 
identical mathematical formulation to the problem of genetic asso- 
ciation, in which case patients play the role of hosts that have been 
compromised by a specific threat of interest. A statistical association 
is then evaluated by calculating the statistical significance (p-value) 
of the discrepancy between the distribution of variants for each gene 
in the population of individuals with the disease and the distribution 
of the same variants in a population of healthy individuals 31 . 

In our model, we measure the risk posed by a service, regardless of 
how many instances are running on a given host. In genetics, this 
corresponds to a common dominant model of genetic penetrance, 
such that the presence of a given (dominant) variant of a gene is 
sufficient to cause an associated phenotype, regardless of the number 
of copies present. We resort to the Fisher exact test to evaluate the 
statistical significance of the associations between threats and ser- 
vices. In the case of GWAS, logistic regressions are also commonly 
used to evaluate statistical significance when the considered genetic 
model allows for different values of the penetrance (e.g. additive or 
multiplicative) for the different possibilities of homozygous and het- 
erozygous cases. The Fisher test used here assumes that the distribu- 
tions of exposed and unexposed hosts are statistically identical for 
both the case and the control groups. In other words, all computers 
are equally likely to be compromised by the threat, regardless of 
whether they are running or not the considered service. Therefore, 
any discrepancies in the proportions of compromised hosts between 
exposed and non-exposed hosts should be due to chance. The test 
provides the probability of observing the proportions measured in 
the data under the null hypothesis. The outcome is the p-value for the 
significance of the association, which, if smaller than an established 
threshold, indicates that the null hypothesis is false. Therefore, in this 
case there is a statistically significant difference in the likelihood of 
falling victim to threat k between hosts that are running service / and 
hosts that are not. 

This allows us to construct a standard Manhattan plot for each 
threat, illustrated in Fig. 2 for three different threats. The plots show 
the negative logarithm of the p-value for each service with the thresh- 
old for significance set at 0.01 , multiplied by an additional Bonferroni 
correction. Each dot is colored according to the sign of the correla- 
tion: red dots above the threshold indicate a positive association 
between the service and the threat; green dots denote negative asso- 
ciations. The resulting plot summarizes the risk profile of a threat. 

Positive associations imply that the service in question constitutes 
a risk factor for computers running the specific service. This could be 
the result of one of two possible scenarios: a) direct association, in 
which the service is directly responsible for the host being compro- 
mised, b) indirect association, where the presence of the associated 
service correlates with the factors that are the true cause of the host's 
infection (note that associations can also be false-positives due to 
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Figure 2 | Using GWAS to identify threat-service associations. Manhattan plots for three threats. For each service along the horizontal axis, the negative 
logarithm of the association p-value is displayed vertically. The dotted line denotes the threshold for statistical significance. The color of each dot 
corresponds to the sign of the logarithm of the corresponding odds ratio. Red dots above the threshold represent the services that are positively associated 
with the corresponding threat. Green dots indicate that the service's absence correlates with the threat. The ordering of the services along the horizontal 
axis is aleatory. 



systematic biases in the data. We find no particular reason to suspect 
that this is at play in our data). Negative associations indicate that the 
service is significantly unlikely to be present on compromised hosts. 
In such cases, it is the absence of the service which correlates with the 
cause of the intrusion. Hence, the specific service may play an effec- 
tive "protective" role. 

Since network services represent a small subset of a system's char- 
acteristics, some of the identified associations will be indirect. An 
example is given by the JBoss Worm in Figure 2b. This threat affects 
the Red Hat Java Application Server, which typically runs behind a 
web server. Accordingly, the service http is present in an average of 
2.55 instances in compromised hosts, while only 0.4 times in those 
unaffected, vmware-auth and flexlm, which are respectively 23 and 
164 times more likely to be present in infected computers, also indi- 
cate that these hosts are part of a virtualized network that makes use 
of the application server. Even though the presence of the ossec-agent 
(a host-based intrusion detection system 354 times more likely to be 
found in compromised hosts) is intended to mitigate intrusions, it is 
also an indication that these are servers maintained by administra- 
tors, consistent with the presence of a JBoss Application Server. The 
association of all these services with the JBoss worm at high signifi- 
cance constitutes an indirect association through their co-occurrence 
with the actual causal factor. 

Nevertheless, given that network services are the point of entry for 
many viruses and network intrusions, we expect many of these asso- 
ciations to be direct. For example, the threat ASP.NET NULL Byte 
Injection Information Disclosure Vulnerability (CVE-2007-0042, 
CVE-201 1-3416) shows high significance for the association with 



the remoting service, which mediates communications in the .NET 
Framework. Since this service is used by ASP.NET for communica- 
tions between server and client, it could plausibly provide an entry 
point for the potential attack, and thus constitute a direct association. 
Since there are alternative setups for this system (for example, using 
Windows Communication Foundation) blocking traffic for this ser- 
vice could have a measurable impact on the network's susceptibility 
to this threat, which would provide experimental evidence for this 
association. 

Validation 

The group of services whose p-value for a given threat lay above the 
statistical threshold are risk factors, implying that hosts running 
these services have a significantly higher likelihood of being compro- 
mised by the threat. To assess the predictive power of the proposed 
procedure, we divide our hosts into two cohorts of equal size, chosen 
randomly. We use one cohort as a calibration sample, identifying the 
risk factors (services) associated with each threat. The other cohort is 
used to measure the effect of running the associated risk-factor ser- 
vices on the likelihood of being compromised by each threat. 
Specifically, after obtaining the list of associated risk-factor services 
from the calibration cohort, we identify the hosts in the test cohort 
which are running at least one of the associated services (exposed 
hosts). We then compare the proportion of hosts that have been 
compromised by the threat amongst the exposed hosts with the 
proportion amongst unexposed hosts. This allows us to establish 
the probability that the observed proportions stem from the same 
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Figure 3 | Validating the GWAS predictions. Hosts running risk factor services are predicted to have significantly increased likelihoods of infection. 
These predictions are also valid for hosts that were excluded from the calculations. The plot shows the odds ratios corresponding to the predicted 
vulnerable hosts for each threat, measured in both the calibration and the test cohorts. The agreement between the two cohorts indicates that the 
information obtained from one sample can be used to make reliable predictions about other computers. 



distribution. The resulting p-value corresponds to the probability 
that any distinction between exposed and unexposed hosts in the 
test cohorts are due to chance, or equivalently, the probability that 
the associated services obtained have no predictive power over the 
probability of infection. As an example, consider the case of the JBoss 
worm. Analyzing the calibration cohort, we find that the services 
most significantly associated with the worm are vmware-auth, net- 
bios-ssn, flexlm and ossec-agent. In the test cohort, 641 hosts are 
running at least one of these three services, 119 of which (18.5%) 
were compromised by the threat. Amongst the remaining 5222 hosts 
that are not running any of these services, only 104 (1.9%) were 
compromised by the threat, a factor of ten difference. Thus, with a 
p-value of 2.44 X 10~ 58 , hosts running any of these three services 
have significantly increased likelihood of being infected by the JBoss 
worm, indicating the effectiveness of the method to detect susceptible 
hosts. 

For 24 threats we successfully identified at least one risk-factor 
service, finding that 17 of these show statistical significance in the 
validation. The extent to which corresponding predictions about the 
incidence of threats in exposed hosts are reliable can be evaluated by 



comparing the odds ratios in both cohorts for each group of exposed 
hosts. In Figure 3 we show the odds ratio of each threat as measured 
in each cohort. In both cases, exposed hosts are those running at least 
one of the associated risk-factor services, obtained using data from 
the calibration cohort only. Although some variability between 
cohorts is observed, there is good agreement between the two 
cohorts, meaning that predictions made on the basis of one group 
of hosts can be carried over reliably to other hosts. 

Implementations 

As we have shown above, the proposed statistical framework can be 
used to make predictions about the threat profile of a computer. If a 
host is automatically scanned when it connects to the network, the 
method allows us to immediately determine the list of threats by 
which the computer has an increased likelihood of being compro- 
mised. Therefore, packet inspection and traffic control can be tai- 
lored accordingly, offering a more efficient allocation of resources. 

As for prophylactic measures at the network level, this method can 
be used to design firewall rules with reliable information about their 
potential effects on the probabilities of compromise. By identifying 
all the hosts running a high risk-factor service, one can concentrate 
resources on enacting stringent controls over a minimal number of 
hosts to obtain a maximal reduction in threat incidence. For the 
example of the JBoss worm, we identify the service ossec-agent as 
the most significantly associated risk-factor and having the highest 
odds ratio. This is a surprising finding given that the purpose of this 
software is to protect the host from intrusions, and it illustrates how 
this approach can help find unexpected correlations. Only 1.9% of 
the computers in the network run this service. Ensuring security 
against the JBoss Worm for these hosts alone could reduce the incid- 
ence of this threat by as much as 47%. Considering combinations of 
services allows us to fine-tune the degree of acceptable interventions 
to obtain a desired threat-reduction. For example, in Table II we 
show the percentage of computers one needs to effectively secure 
to attain different percentages of reduction in incidence of the 
JBoss worm if we consider hosts that are running a minimum num- 



Table II | Host protection and threat reduction. If we can ensure the 
safety of a subset of the network by stringent control (or limit risk by 
blocking access to the hosts), we can calculate the resulting reduc- 
tion in the incidence of a given threat. The desired degree of reduc- 
tion in threat incidence determines a threshold for selecting hosts 
according to the risk factor services they run. This table shows the 
expected reductions in the incidence of the JBoss worm by means of 
different selection criteria 



Number of risk-factors 


Covered hosts 


Threat reduction 


at least 1 


66.6% 


98.9% 


at least 5 


14% 


79.5% 


at least 9 


2.1% 


45.1% 


at least 1 2 


1 .7% 


40% 
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ber of associated risk-factor services. Adopting a selection criterion 
carries a trade-off between the scope of an intervention and the 
resulting threat reduction. 

Of course, infallible prevention can hardly be achieved without 
resorting to drastic interventions, such as removing or otherwise 
isolating the host from the network. With a better understanding 
of the causality between threats and services, much less intrusive 
measures could be implemented with comparable results. 

Discussion 

In this paper we introduced a novel way to identify and quantify the 
susceptibility of individual computers to cyber-threats in terms of the 
network services that they run. We do so by establishing a direct 
mapping between the threat susceptibility problem and that of iden- 
tifying statistical associations between genes and diseases in humans. 
The adopted approach relies on the methodologies developed in 
GWAS, aiming to associate genes with specific diseases. 

Efficient and sophisticated tools to detect software vulnerabilities 
in a host are already available and used widely. However, they are 
limited to known and well-documented vulnerabilities, and their 
ability to evaluate the risks of compromise rely on heuristic metrics. 
The statistical approach developed here allows us to identify associa- 
tions without preconceived assumptions about the underlying 
mechanisms. We can therefore explore the whole range of the vari- 
ables that we consider (in this case, network services) in search for 
risk factors that may have been overlooked before. Such "agnostic", 
data-driven approaches are particularly valuable for making predic- 
tions in real-world scenarios, as they bypass the need for mechanistic 
explanations and contextual knowledge. Since we consider effective 
frequencies of infection in vivo, not only technical factors, but also 
behavioral, environmental and exogenous factors are automatically 
taken into account. For this reason, the proposed method offers 
reliable predictions about the effects of specific changes in the net- 
work in real-world environments. 

The basis for this method resides in the observation that the net- 
work services running on a host play a defining role in its suscept- 
ibility to certain threats. This is not to deny that other variables may 
be equally important. Note, however, that our framework is not 
limited to services: if the appropriate data on other risk factors 
becomes available (e.g., operating system, accounts and privileges, 
hardware and peripheral devices, applications, drivers, manufac- 
turer, etc.), our method can determine their role for specific threats. 
Nevertheless, it is remarkable that such high statistical significance is 
observed using only services as risk factors. In contrast to other 
methods to assess risks that involve in-depth scanning, for which 
the collection of data can often be intrusive and even disruptive, the 
kind of data that our approach requires can be easily and system- 
atically collected. 
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